Performance Analysis of k-NN on High Dimensional Datasets

نویسندگان

  • Pradeep Mewada
  • Jagdish Patil
  • Tom M. Mitchell
  • Shailendra K. Shrivastava
  • Loris Nanni
  • Alessandra Lumini
  • A. K. Pujari
  • Gursel Serpen
  • Hamid Parvin
  • Hosein Alizadeh
  • Mohsen Moshki
  • Behrouz Minaei-Bidgoli
  • Naser Mozayani
چکیده

Research on classifying high dimensional datasets is an open direction in the pattern recognition yet. High dimensional feature spaces cause scalability problems for machine learning algorithms because the complexity of a high dimensional space increases exponentially with the number of features. Recently a number of ensemble techniques using different classifiers have proposed for classifying the high dimensional datasets. The task of these techniques is to detect and exploit relevant patterns in data for classification. The k-nearest neighbor (kNN) algorithm is amongst the simplest of all machine learning algorithms. This paper discusses various ensemble k-NN techniques on high dimensional datasets. The techniques mainly include: Random Subspace Classifier (RSM), Divide & Conquer Classification and Optimization using GA (DCC-GA), Random Subsample ensemble (RSE), Improving Fusion of dimensionality reduction (IF-DR). All these approaches generates relevant subset of features from original set and the results is obtain from combined decision of ensemble classifiers. This paper presents an effective study of improvements on ensemble k-NN for the classification of high dimensional datasets. The experimental result shows that these approaches improve the classification accuracy of the k-NN classifier.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison of MLP NN Approach with PCA and ICA for Extraction of Hidden Regulatory Signals in Biological Networks

The biologists now face with the masses of high dimensional datasets generated from various high-throughput technologies, which are outputs of complex inter-connected biological networks at different levels driven by a number of hidden regulatory signals. So far, many computational and statistical methods such as PCA and ICA have been employed for computing low-dimensional or hidden represe...

متن کامل

FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

We present FLASH (Fast LSH Algorithm for Similarity search accelerated with HPC (High-Performance Computing)), a similarity search system for ultra-high dimensional datasets on a single machine, which does not require similarity computation. Our system is an auspicious illustration of the power of randomized algorithms carefully tailored for high-performance computing platforms. We leverage LSH...

متن کامل

L1-graph construction using structured sparsity

As a powerful model to represent the data, graph has been widely applied to many machine learning tasks. More notably, to address the problems associated with the traditional graph construction methods, sparse representation has been successfully used for graph construction, and one typical work is L1graph. However, since L1-graph often establishes only part of all the valuable connections betw...

متن کامل

Modification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis

Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...

متن کامل

The distance function effect on k-nearest neighbor classification for medical datasets

INTRODUCTION K-nearest neighbor (k-NN) classification is conventional non-parametric classifier, which has been used as the baseline classifier in many pattern classification problems. It is based on measuring the distances between the test data and each of the training data to decide the final classification output. CASE DESCRIPTION Since the Euclidean distance function is the most widely us...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011